Enhancing Falcon Model Performance with Amazon SageMaker

When it comes to optimizing the performance of large language models (LLMs) for text generation in generative AI applications, choosing the right framework and configuration can be quite challenging. This task is complicated by factors such as model size, diverse architectures, and specific application performance needs. The Amazon SageMaker Large Model Inference (LMI) container simplifies the deployment of LLMs by integrating various frameworks and techniques designed to enhance performance. Notably, the LMI container features a robust serving stack called DJL Serving that is agnostic to the LLM being used. It allows for system-level configuration adjustments that can be fine-tuned to maximize the efficiency of the hosting infrastructure for specific LLMs. Additionally, it incorporates advanced optimizations like continuous batching, which can significantly boost throughput.

In a previous article, we demonstrated how to utilize the LMI container to deploy the Falcon series of models on SageMaker. In this article, we explore methods to enhance the throughput and reduce latency of serving Falcon-40B using continuous batching techniques. We also provide an accessible overview of configuration parameters available within the SageMaker LMI container to help you optimize your settings for real-world applications.

Basics of Text-Generative Inference for LLMs

To start, let’s review some essential concepts regarding inference for LLMs in text generation.

Forward Pass, Activations, and the KV Cache

When an input sequence of tokens is fed into the LLM (like Falcon), it undergoes a forward pass through all layers to generate the next token. A forward pass refers to the input data processing through a neural network to produce an output. For text generation, this involves introducing an initial seed or context into the language model and iterating the process to produce subsequent characters or tokens until the desired text length is achieved.

LLMs such as Falcon or GPT utilize an autoregressive approach, generating tokens one at a time while conditioning on previously produced tokens. During this autoregressive decoding, all input tokens generate corresponding attention key and value tensors, which are stored in GPU memory to assist in producing subsequent tokens. These stored tensors are commonly referred to as the KV cache.

Prefill and Decode Phases

In autoregressive decoding, two main phases are crucial for generating coherent and contextually relevant text: the prefill phase and the decode phase.

Prefill Phase:

Initial Context: The process begins with user-provided seed text, serving as the starting point for text generation.
Model Conditioning: The context is utilized to condition the language model, predicting the next token based on this input.
Token Generation: Tokens are generated sequentially, each one being added to the context to extend it.
Iterative Process: This token generation continues iteratively until a set stopping condition is met.

Decode Phase:

Completion: Post-prefill, the model works to complete the text, ensuring coherence and grammatical accuracy.
Continuation: The model begins with the last generated token, using it as the starting point to produce the next token.
Iterative Completion: The process is iterative, with the model generating one token at a time, conditioning on preceding tokens.
Stopping Condition: Similar to the prefill phase, a stopping condition is established, terminating the generation process when met.

The combination of these two phases allows autoregressive models to produce coherent, contextually relevant text sequences.

Refer to A Distributed Serving System for Transformer-Based Generative Models for a comprehensive explanation of the process.

Optimizing Throughput with Dynamic Batching

So far, we’ve discussed single inputs; however, multiple concurrent requests from application clients are more typical. Traditional basic batching can increase throughput and better utilize GPU resources. Batching combines multiple requests into a single batch for parallel autoregressive forward passes, effectively handled at the serving side. The SageMaker LMI’s DJLServing server can be configured to dynamically batch multiple requests using parameters in serving.properties:

max_batch_delay = 100 – This sets the maximum delay for batch aggregation to 100 milliseconds.
batch_size = 32 – This defines the dynamic batch size, defaulting to 1.

With these settings, DJLServing will queue requests for a maximum of 100 milliseconds or until the specified batch size is reached, at which point it processes the batch for inference. The dynamic nature of this batching allows for variations in request volume. However, since requests may differ in characteristics (e.g., some may have 20 tokens of input and 500 output), processing times can vary, leading to potential underutilization of GPU resources.

For more information on the latest regulations and best practices, visit SHRM.

For those interested in a career opportunity, check out this job posting.

Site Address: 6401 E HOWDY WELLS AVE, LAS VEGAS, NV 89115
Site Location: Amazon IXD – VGT2